There are many challenges that prevent existing data from being found and reused. Hence, understanding how researchers and support professionals discover and use data may facilitate future data reuse. Additionally, analyzing the factors limiting researchers’ ability to reuse data can help research support professionals better understand how to assist researchers looking for data.
The data from our project was generated from a globally distributed survey. The goal of the survey was to investigate the habits of reusing and sharing at a larger scale. The data set contains 1677 responses from 105 countries and 31 unique disciplines.
The data that was collected consisted of only categorical variables. Additionally, there were multiple questions in the survey that included an open-ended response.
The types of analyses we carried out included bar charts, histograms, pie charts, classification trees, word clouds, etc.
The remainder of this report is structured as follows:
• In Section 1, we will examine and present visuals on the support data set. It will cover the basic EDA on the most important variables.
• In Section 2, we will cover the researchers data set.
• The final section, Section 3, will show and explain all analyses that were done comparing the 2 data sets. (Support vs Researchers)
Some major findings we discovered were that individuals would rather encourage data sharing over data reuse, there is a greater percentage of respondents who indicated they discouraged data sharing from the Middle East and Africa compared to all other regions, and data is used most as basis for new study.
Citation for the data:
Gregory.(2020). Data discovery and reuse practices of researchers and research support professionals.[Data set]. DANS-EASY. DOI.
Citation for the additional publication that used the data:
Gregory, K., Groth, P., Scharnhorst, A., & Wyatt, S. (2019). Lost or found? Discovering data needed for research. arXiv preprint arXiv:1909.00464.
The data contains 1677 total responses, 105 countries, 31 disciplines
The support dataset contained 47 respondents with 167 variables. Respondents include librarians, archivists, and research/data suport providers. The researchers dataset contained 1630 respondents with 165 variables. Respondents include researchers, students, managers, and “other” in which individuals indicated in the open response portion what role they identify with. Open response answers included professors, engineers, educators, physicans etc..
It should be noted that the support dataset has a relatively small sample size, so any conclusions or takeaways should be taken into account with the size of the data.
The support dataset comprises of respondents whose roles include: Librarian, archivist, and research/data support provider.
We begin with general overview of the respondents, then progress into more specific analysis of the variables and significant testing where relevant.
The following shows the the count of the people whom the respondents support:
| x | |
|---|---|
| whosupprt_stud | 35 |
| whosupprt_res | 44 |
| whosupprt_indus | 7 |
| whosupprt_oth | 9 |
| whosupprt_othresp | NA |
The majority of respondents either support researchers or students.
The following analyses will examine the demographics of respondents from the support dataset (experience & discipline & country).
Almost half of the respondents in the support dataset have 6-15 years of experience. It should be noted that there are only 3 respondents with 31+ years of experience in this dataset.
Based on the support dataset, the most common disciplines are part of natural and applied sciences, such as information science, environmental sciences, and social science. This may be because new research data within the science field may not be as readily available compared to other disciplines.
Based on the support dataset, the most common disciplines are part of natural and applied sciences, such as information science, environmental sciences, and social science. This may be because new research data within the science field may not be as readily available compared to other disciplines.
The most common countries of employment of the respondents from the support data are USA and the United Kingdom. When the countries were grouped by continent, Europe and North America were the most common respectively.
The most common countries of employment of the respondents from the support data are USA and the United Kingdom. When the countries were grouped by continent, Europe and North America were the most common respectively.
The following examines what kinds of data the research support professionals need:
## need_open need_obs need_exp need_sim need_deriv need_oth
## 32 39 19 18 24 8
## need_othresp
## 8
##
## need_obs need_open need_deriv need_exp need_sim need_oth
## 0.26351351 0.21621622 0.16216216 0.12837838 0.12162162 0.05405405
## need_othresp
## 0.05405405
Based on the bar graph, observational or empirical data was needed the most among the respondents in the support data.
Conclusions:
In relation to data needs, the researchers and support datasets are very similar, as observational or empirical data is needed the most compared to the majority of demographic categories in both datasets.
Comparing the 4 main data needs variables (observational or empirical data, experimental data, simulation data, derived or compiled data) to respondents’ years of experience, we see that the graphs are relatively similar between most of the categories. However, when looking at the relationship between simulation data needs and experience, the percent of simulation data in the 31+ years of experience group is substantially greater than the other 3 experience categories. This may suggest that more experienced respondents may be more likely to need simulation data because of potential reasons such as needing large quantities of data or using simulations for more advanced studies.
The following examines why the respondents or the people they support use or need secondary data, starting with the count of different uses and followed by a graph.
## use_nwstdy use_calb use_bnchmrk use_vrfctn use_inpt use_idea
## 38 12 16 20 19 26
## use_tch use_nwprj use_nwmth use_trnds use_cmprsn use_smvs
## 37 29 19 22 28 25
## use_intgrtn use_oth use_othresp
## 27 1 1
The 3 most common purposes for which data is used are for a new study, for teaching/training, and to prepare for a new project/proposal respectively.
Conclusions:
Comparing the 4 main data use variables (basis for a new study, teaching/training, new project or proposal, compare multiple datasets) to respondents’ years of experience, we see that using data for teaching or training is very common across all experience levels. This is reasonable because it can be assumed that respondents with lower levels of experience may be using data for training while more experienced respondents may be teaching others. When it comes to respondents with 31+ years of experience, they were much more likely to use data for other reasons, such as preparing for a new project, rather than as a basis for a new study.
Comparing data use to respondents’ disciplines, we can observe that using data for teaching or training occupies a large percentage in all of the disciplinary subsets. This may emphasize that data is frequently used to teach or train no matter what discipline one may be in.
The following examines how research support professionals find their data.
## find_actonln find_serendsrch find_serendpas find_share find_netwk
## 47 47 47 47 33
## find_creatr find_collab find_conf find_list
## 12 14 29 31
The most common way the respondents find their data is by actively searching online.
Conclusion:
The most common way the respondents find their data is by actively searching online shown by the first column in the graph above. However, even though serendipitously finding data did not happen as often, it should be noted that it is still an occasional occurrence.
The following examines what sources the respondents discover their data:
##
## find_netwk find_list find_conf find_collab find_creatr
## 0.2773109 0.2605042 0.2436975 0.1176471 0.1008403
The most common source used to discover data is via conversations with personal networks, followed by via mailing lists or forums and via conferences.
Conclusion:
When asking the respondents about the ways they or the people they support discover data, the most common source used is via conversations with personal networks, followed by via mailing lists or forums and consequently via attending conferences.
Conclusions:
Looking at the ways the respondents found data based on their experience levels, we see that almost half of those with 0-5 years of experience found their data with conversations with personal networks. However, as experience levels increased, the percentage of data found via the networks decreased dramatically. This may suggest that more experienced respondents are more likely to look for and find data outside of their initial network, such as using mailing lists or attending conferences instead.
Comparing all the individual graphs, we can see that finding data via conversations with personal networks is relatively common among all the disciplines.
Outside of North America and Europe, other countries seem to have a limited number of sources in which they find data. For example, none of the respondents from Australia/New Zealand and South/Central America indicated that they find data via attending conferences or via developing collaborations. However, the limited number of responses from those continents should be taken in consideration.
Conclusions: 1. The most used sources are the government, literature, and search engines. The least used source is commercials.
##
## 1-sample proportions test with continuity correction
##
## data: reuse_counts$dem_reuseself out of reuse_counts$dem_reusegrp, null probability 0.5
## X-squared = 3.7812, df = 1, p-value = 0.05183
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.4986377 0.8325051
## sample estimates:
## p
## 0.6875
##
## 1-sample proportions test with continuity correction
##
## data: reuse_counts$dem_reuseself out of reuse_counts$dem_reusedisc, null probability 0.5
## X-squared = 3.7812, df = 1, p-value = 0.05183
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.4986377 0.8325051
## sample estimates:
## p
## 0.6875
##
## 1-sample proportions test with continuity correction
##
## data: reuse_counts$dem_reuseself out of reuse_counts$dem_reuseorg, null probability 0.5
## X-squared = 1.3611, df = 1, p-value = 0.2433
## alternative hypothesis: true p is not equal to 0.5
## 95 percent confidence interval:
## 0.4352665 0.7637567
## sample estimates:
## p
## 0.6111111
In this section, we investigate the researchers data set containing survey answers from respondents who are researchers, students, managers, and individuals who indicated in open-response their roles including professors, educators, physicians, engineers to name a few.
This section begins with some overviews of the respondents in the data set, followed by more specific analysis through investigating certain variables grouped by the demographic characteristics of the respondents.
Most of the respondents from this dataset are researchers.
Similar to the support dataset, the most frequent years of experience is between 6-15 years, followed by those with 16-30 years of experience.
Similar to the support dataset, respondents who specialize in the natural and applied sciences greatly outnumber the other disciplinary subsets.
Similar to the support dataset, respondents who specialize in the natural and applied sciences greatly outnumber the other disciplinary subsets.
Most of the respondents from this dataset are employed in the USA and in continents such as Europe, North America, and Asia respectively.
Most of the respondents from this dataset are employed in the USA and in continents such as Europe, North America, and Asia respectively.
We explore what respondents need data for in barplots below.
The greatest area of need for respondents is obervational or emperical data followed by experimental data.
We then group respondents’ data need by their discipline groups. [Note, depending on how the disciplines are grouped, the results of the barplot could change].
Observational or emperical data is needed the most out of all grouped disciplines, accounting for the highest percentage of data need in Social Sciences with over 50% of respndents of that discipline needing that type of data.
It appears that observational or empirical data is most needed followed by experimental data across all of the discipline groups. Overall, the trends in data need is as expected for each discipline group with respondents in the natural sciences having the greatest proportion needing experimental data, respondents in business having the greatest proportion needing simulation (models) data, respondents in the humanities having the greatest proportion needing derived or compiled data. Note, as mentioned above, these results may differ if the original disciplines from the survey were grouped differently.
We then explore trends in who finds data for the respondents.
Over 50% of respondents find data themselves followed by nearly 25% of respondents who find data from someone in their personal network.
Over 50% of respondents find data themselves and almost 25% find data through someone in their personal network.
We then split by experience group below.
The majority of researchers in all experience groups find data themselves followed by finding data through someone in their personal network. Excluding the group other, the proportion of respondents in the 0-5 experience group who find data from graduate students is the lowest, and the proportion of respondents in the other experience groups who find data from research support professionals is the lowest.
It appears that across all experience groups, respondents indicated they most often find data themselves. The percent of each experience group who indicate they find data themselves decreases as experience increases, nearly 60% of respondents in the 0-5 groups compared to less than 50% of the 31+ experience group. This may indicate the ability to outsource the data finding process to research assistants and graduate students among more experienced respondents or that more experienced respondents are more aware of support professionals who may aid in finding data.
We see similar levels of finding data through someone in their personal network and through research support professionals across all experience groups.
Next, we explore the challenges to finding data.
Data is not accessible is the most common challenge to finding data followed by data being in many different places.
We then split by experience group to better understand the challenges experienced at all experience levels.
In all experience groups, data are not accessible is the most common challenge among respondents, followed by data are in many different places. Excluding the challenge other, not having necessary personal networks appear to be the least common challenge throughout the experience groups. A smaller percentage of respondents in the 31+ experience group considers data not being accessible a challenge compared to the rest of the groups.
It appears that the challenges to find data across all experience groups are similar with data being not accessible and data in many different places as the first and second most common challenge. Nearly 30% of respondents in the 0-5 experience group indicated data not accessible as a challenge while that challenges has less than 25% of respondents in the 31+ experience group. Contrasting, the percentage of respondents indicating data are in many different places increase as experience increases. This trend may indicate that the more experience a respondent has, the easier it becomes to navigate resources to find data, however, this accessibility may reveal that necessary data are inconveniently located in many different places. It could indicate the need for better resources for researchers with less experience to make data accessible for them and better consolidation or organization of data.
We explore some of the challenges by isolating the respondents who indicated they face a certain challenge and plotting other variables. The most interesting trends are shown below.
Relationship between data source and respondents who indicated they did not know where or how to find data [note that some sources did not show much difference and were not included in the plots. Only sources which had interpretable trends are shown below.]:
First, we isolated respondents who indicated they don’t know where or how to find data and compared their frequency of use of consultation with research support, data specific search engines, discipline-specific data repositories and general search engines with the frequency of use of the entire respondent group.
A lesser proportion of respondents who indicated they do not know where or how to find data as a challenge use consultation with research support, data specific search engines, and disciplinary specific data-repositories.
There appears to be a greater percentage of respondents who indicated they never use consultation with research support in the don’t know where and how to look for data group than the entire group. Percentage of indicating never for data specific search engines and disciplinary specific data repository increased in the don’t know where and how to look for data group compared to the entire group. Percentage of using the previously mentioned sources often decreased in the don’t know where and how to look for data group. These changes appear to indicate that the percentage of the group who indicated their challenge to finding data included not knowing where or how to find data use targeted sources (i.e data or disciplinary specific search engines and repositories and support professionals) less than the entire respondent group, and used general search engines such as Google more.
Next we compare the trends in the frequency of using discipline-specific data repositories, multidisciplinary data repositories, governmental agencies and websites, and professional associations between respondents who indicated they think online tools are inadequate and the entire respondent group [note that only sources which had interpretable trends are shown in the plots below]:
Respondents who indicated a challenge to finding data is inadequate online tools appear to use the following sources more than the entire respondent group.
[Note that since the differences in the plot are slight, below is the percentages represented in the plot for a clearer comparison]
Discipline specific data repository: all respondents
##
## Never Occasionally Often
## 0.333 0.333 0.333
Discipline specific data repository: inadequate online tools
##
## Never Occasionally Often
## 0.173 0.402 0.425
Multidisciplinary data repositories: all respondents
##
## Never Occasionally Often
## 0.333 0.333 0.333
Multidisciplinary data repositories: inadequate online tools
##
## Never Occasionally Often
## 0.289 0.501 0.210
Government agencies and websites: all respondents
##
## Never Occasionally Often
## 0.333 0.333 0.333
Government agencies and websites: inadequate online tools
##
## Never Occasionally Often
## 0.180 0.455 0.365
Professional associations: all respondents
##
## Never Occasionally Often
## 0.333 0.333 0.333
Professional associations: inadequate online tools
##
## Never Occasionally Often
## 0.365 0.461 0.175
It appears that respondents who indicated a challenge to finding data was inadequate online tools use discipline specific data repositories, multidisciplinary data repositories, government agencies and websites, and professional association more than the entire respondent group. This could point to those sources lacking adequate online tools for researchers to find data.
Note that other sources such as general search engine, data specific search engines, code repositories etc. were not included because there was very little difference in percentage of frequency of use between all respondents and those who indicated inadequate online tools as a challenge.
We explore the ease of finding data between students and researchers. Additionally we look at the challenges the students and researchers face when finding data.
Ease of finding data for students
##
## Difficult Easy Sometimes challenging
## 9 3 61
##
## Difficult Easy Sometimes challenging
## 0.1233 0.0411 0.8356
Ease of finding data for researchers
##
## Difficult Easy Sometimes challenging
## 260 124 988
##
## Difficult Easy Sometimes challenging
## 0.1895 0.0904 0.7201
A barplot of the levels of ease in finding data is shown below:
It appears that more students indicated data is somewhat challenging to find compared to researchers, 84% of students compared to 72% of researchers. However, more researchers indicated data is difficult to find, 19%, compared to students, 12%.
It appears that more students indicated data is somewhat challenging to find compared to researchers, 84% of students compared to 72% of researchers. However, more researchers indicated data is difficult to find, 19%, compared to students, 12%. A greater proprotion of researchers indicated finding data was easy than students.
We explore the challenges students and researchers indicated below.
Most students and researchers indicate that data was not accessible followed by data are in many different places
There not appear to dramatic differences in the challenges to finding data faced by students and researchers. Both the students’ and researchers’ greatest challenges is that data are not accessible followed by data are in many different places. Interestingly, the percentage of researchers who indicated they don’t have necessary personal networks or don’t know where or how to look for data is less than students. This makes sense as researchers may have more experience than students and thus would be expected to have a richer personal network and have more skills or experience in finding data.
The proportion of researchers who indicated online tools are inadequate and data are not digital is greater than students. These challenges could point resources for finding data have not kept up with the transition of data to a digital format or on platform online.
We then look at the frequency of using certain sources to find data.
Academic literature and general search engines appear to be the most popular data sources with nearly 80% of respondents indicating they often use academic literature and nearly 70% of respondents indicating they often use general search engines. Code repositories and commercial sources appear to be the least popular data sources with just under 80% and just under 70% of respondents indicating they never use those sources respectively.
We now explore the frequency of use of certain data sources by respondent experience level [note that some sources are omitted because there was no interesting difference between the trends in use of all the respondents compared to respondents of different experience levels].
As experience level increases, the frequency of using Academic Literature increases as well with under 60% of respondents in the 0-5 group indicating they often use Academic Literature to nearly 80% of respondents in the 31+ group.
Respondents in the 0-5 and 6-15 experience groups had similar percentages indicate they never or occasionally use Consultation with Research Support to find data while the percentage of respondents in the 16-30 and 31+ groups had a greater percentage indicate they occasionally use this source to find data than the percentage indicating they never use it.
The greatest percentage of respondents who indicate they often use General Search Engines (e.g Google) to find data occurred in the 0-5 group while the percentage of respondents in other groups who often use this source remain relatively the same at just under 60%.
It appears that the frequency of using these sources to find data broken down by experience group does not differ greatly from the trends across all respondents. For using Academic Literature, the percentage of respondents who indicated they often use the source increased with experience level perhaps due to the level of expertise needed to understand academic literature. For General Search Engines (e.g. Google), the percentage of respondents who indicated they often use the source decreases with the increase in experience level perhaps due to the knowledge of more effective and reliable sources such as academic literature or discipline specific resources. For Consultation with Research Support, the percentage of respondents who indicate they occasionally use this source to find data increases with experience level while, similarly, respondents who indicated they never use this source decreases as experience increases. This may be due to greater accessibility to support professionals among more experienced respondents, and more confidence in seeking out professional support.
We investigate trends in how data is used below.
Data is most commonly used for new studies followed by new projects and teaching and training. Excluding the variable other, data is used least for calibrating instruments or models.
Note that since data is most commonly used for new studies followed by preparing for a new project or proposal and then to generate new ideas, it appears data is mostly used to create or discover new things. This could be taken into account when interpreting the analyses on data sharing and reusing later in the report.
We then split data use by respondent discipline group.
Using data for the basis of a new study appears to be the most popular way to use data across all discipline groups.
Across discipline groups, data is used most as basis for new study comprising of the highest percentage of respondents in the Humanities and Arts discipline at just over 15%. The second highest percentage of respondents in Natural Science discipline group indicated they use data to verify their own data perhaps signaling the importance of repeated findings to validate the accuracy of researcher’s data. For respondents in Healthcare and Medicine and Multidisciplinary groups, the second highest percentage indicated using data to prepare for new project or proposal while the second highest percentage of respondents in the CS and Engineering group indicated using data for models, algorithms and system inputs and experiment with new methods or techniques.
We explore the respondents’ personal perception of data sharing and reusing below:
Personal perception of data sharing
##
## Data sharing is neither encouraged nor discouraged
## 151
## Data sharing is somewhat discouraged
## 30
## Data sharing is somewhat encouraged
## 440
## Data sharing is strongly discouraged
## 10
## Data sharing is strongly encouraged
## 972
## Don't know/ Not applicable
## 27
##
## Data sharing is neither encouraged nor discouraged
## 0.0926
## Data sharing is somewhat discouraged
## 0.0184
## Data sharing is somewhat encouraged
## 0.2699
## Data sharing is strongly discouraged
## 0.0061
## Data sharing is strongly encouraged
## 0.5963
## Don't know/ Not applicable
## 0.0166
Personal perception of data reusing
##
## Data reusing is neither encouraged nor discouraged
## 214
## Data reusing is somewhat discouraged
## 50
## Data reusing is somewhat encouraged
## 517
## Data reusing is strongly discouraged
## 44
## Data reusing is strongly encouraged
## 751
## Don't know/ Not applicable
## 54
##
## Data reusing is neither encouraged nor discouraged
## 0.1313
## Data reusing is somewhat discouraged
## 0.0307
## Data reusing is somewhat encouraged
## 0.3172
## Data reusing is strongly discouraged
## 0.0270
## Data reusing is strongly encouraged
## 0.4607
## Don't know/ Not applicable
## 0.0331
We now investigate deeper into respondents who indicate their perception of data sharing is “somewhat discourage” or “strongly discourage” data sharing.
Of the 40 respondents who indicated their perception of data sharing is somewhat or strongly discourage, the geographical region of the respondents are shown below:
##
## Africa Asia Australia/New Zealand
## 0.05 0.17 0.02
## Europe Middle East North America
## 0.41 0.04 0.19
## South/Central America
## 0.11
##
## Africa Asia Australia/New Zealand
## 0.10 0.10 0.03
## Europe Middle East North America
## 0.38 0.15 0.17
## South/Central America
## 0.07
There is a greater percentage of respondents who indicated they discouraged data sharing from the Middle East (15%) and Africa (10%) that the percentage of respondents from those regions out of the entire respondent group (4% and 5% respectively). This could potentially indicate regional differences in perception of data sharing.
Below is the perception of data sharing in larger scales such as among the work group, disciplinary community and at an organizational level:
It appears that the percentage of respondents who discourage data sharing among disciplinary community, organization or at work are dramatically lower in the entire survey group compared to just respondents who indicated they discourage data sharing. This is not surprising but the biggest difference occurring in the work group may indicate that the work those respondents engage in heavily influence their own perception of data sharing. Perhaps these respondents deal with sensitive information for their work, working for government agencies or secretive companies.
Below is the perception of data reuse on the same levels: work group, disciplinary community and at an organizational level.
Although there are higher percentages of respondents is the discourage data sharing group who indicated they discourage data reuse across all levels, more respondents indicated they somewhat encourage or neither encourage nor discourage data reuse compared to the majority of respondents indicating they discourage data sharing. This could indicate that this group of respondents are primarily concerned about sharing data, even more specifically, they perceive their work group as the least enthusiastic about sharing data.
To begin we will be exploring 4 different factors. The first 2 falls under perception of reuse and sharing, while the next two are variables that are transformed from given data.
1.avg_share (numerical) refers to the mean share across 4 types of sharing for each respondents.
2.avg_reuse (numerical) refers to the mean reuse across 4 types of reuse for each respondents.
3.disc_total(numerical) refers to the the total number of discipline reported by each individual, it is to be taken as a proxy measure of how multidisciplinary one’s work might be. It is a numerical value from 1 to 20.
4.total_chal (numerical) refers to the total number of challenges reported by individuals when finding data. For each challenges, they are reported a “1” for “Yes” and “0” for “No”. It is then summed up under a new numerical variable, “total_chal”.
There are many interesting and important demographics that can be explored. In this section, we will only be focused on a few aspects of demographics.
Role of respondents (dem_role) - Researchers, Students. We will be focusing on two primary group of respondents, as they are most relevant to our research questions and nature of work, given that we are helping our clients at CMU Library with researchers and students
Experience group (dem_exprnce) - These are the 4 experience group provided as used earlier.
Shared (Y/N) (dem_shared) - It refers to whether the respondents have indicated that they have shared data. “Yes” being classified as “sharers” and “No” as “non-sharers”.
To begin, we plot the 4 numerical variables in a scatterplot matrix to identify some of the interesting relationships we might investigate in detail further.
To begin with the roles classification, we will explore the following after classifying their type of role. (1) how they view sharing personally. (2) how they view sharing across disciplines. (3) How they view reuse personally. (4) How they view reuse across disciplines
##
## Call:
## lm(formula = avg_reuse ~ avg_share, data = main.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.3297 -0.3304 0.2525 0.6696 2.9991
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.00085 0.10336 9.683 <2e-16 ***
## avg_share 0.66577 0.02599 25.616 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9965 on 1628 degrees of freedom
## Multiple R-squared: 0.2873, Adjusted R-squared: 0.2868
## F-statistic: 656.2 on 1 and 1628 DF, p-value: < 2.2e-16
We regress average reuse on average sharing using linear regression. The adjusted R-squared is 0.2868.
Conclusions Firstly, both researchers and student share roughly similar trends for sharing individually. Both indicate more than 85% of somewhat and strongly encouraged for sharing. One difference is that for researcher, strongly encouraged is a large majority almost doubling that of somewhat encouraged while for student it is more evenly split.
Secondly, when looking at fig for sharing by disciplines, the distribution for is more symmetric than the first set of plots. When comparing between researchers and students, the distribution is again roughly similar.
However, when comparing between sharing types, one can see evidently that the strongly encouraged responses almost halved for sharing within disciplines and more were neutral or even discouraged when compared to sharing personally. This is aligned with the overall trend of the dataset.
Firstly, both researchers and student respond with overwhelming positivity for reusing of data personally, as seen in the left skewness of the data. This is encouraging to see, especially for that for both students and researchers where only 10% are discouraged to reuse data. For Students, there is also a very large portion of responses that felt they were strongly encouraged to reuse data, more than doubling that of those who felt only somewhat encouraged.
When pertaining to reuse within disciplines, the distribution is less skewed.The distribution for reusing by disc is more symmetric than that of reuse by self. Overall, the 2 roles do not show a huge difference in how they are discouraged. For researchers, somewhat and strongly discouraged makes up less than 10% and for students it adds up to a little more than 10%. However, a ar larger percentage of respondents in both group felt encouraged and more were neutral about it than reusing personally.
Main takeaway The role of the respondents account for little difference in the data set since that a large majority are researchers. ( 1445 out of 1630 ) It has the largest influence on the overall trend of the data hence, moving on we would not have a need for differentiating by roles given that both share similar trends on top of the dominant effect of the researchers role type.
Moving on we will try to investigate another aspect of demographics which is the level of experience. Firstly we check the relationship of average reuse vs average sharing with different experience group.
Similarly, we will investigate the following: (1) how they view sharing personally. (2) how they view sharing across disciplines. (3) How they view reuse personally. (4) How they view reuse across disciplines
Given that one more likely to share and reuse data individually than in group, we seek to investigate if their inclinations in a group setting can be used to predict how likely they are willing to share/reuse data individually.
##
## Call:
## lm(formula = dem_sharself ~ dem_shardisc + dem_shargrp + dem_sharorg,
## data = main.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.7970 -0.4345 0.0007 0.4954 2.4743
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.52567 0.07362 34.307 < 2e-16 ***
## dem_shardisc 0.13227 0.02209 5.989 2.6e-09 ***
## dem_shargrp 0.29242 0.02216 13.198 < 2e-16 ***
## dem_sharorg 0.07003 0.01892 3.701 0.000221 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8112 on 1626 degrees of freedom
## Multiple R-squared: 0.2992, Adjusted R-squared: 0.2979
## F-statistic: 231.4 on 3 and 1626 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = dem_reuseself ~ dem_reusedisc + dem_reusegrp + dem_reuseorg,
## data = main.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2359 -0.3100 0.0154 0.3646 3.4235
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.611550 0.063344 25.441 < 2e-16 ***
## dem_reusedisc 0.133147 0.022610 5.889 4.71e-09 ***
## dem_reusegrp 0.548475 0.023088 23.756 < 2e-16 ***
## dem_reuseorg -0.007012 0.018855 -0.372 0.71
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8545 on 1626 degrees of freedom
## Multiple R-squared: 0.5094, Adjusted R-squared: 0.5085
## F-statistic: 562.7 on 3 and 1626 DF, p-value: < 2.2e-16
Observation There is higher correlation between their indication or reusing data within a group and reusing one-self than that of sharing. This indicates perhaps the behavior of reusing data in a group setting is a better indicator of reusing data individually than sharing in a group setting can used to predict if individuals are likely to share data.
In other words, the sharing perceptions are much more varied.
Across all types of classification, a common trend for the difference between group sharing/reusing and individuals exists in that individual preference are more encouraging than that within groups.
These could be due to individuals being more clear and decisive in their data practices and therefore able to respond more positively when appropriate. When in group settings, the data practices or protocols might be differ from individuals or even have less clarity as there are differing priorities and projects all the time. The overall data practices and methodology might also encourage less sharing and reuse.
Another possible reason for a lack of reuse tendency in group settings, individuals feel more pressured by their peers and counterparts to engage in collection of primary data and not engage freely in reusing prior data. With that it leaves us room to explore why if other factors affected their inclination, like data needs, uses or even fields of expertise.
Next, we explore 2 other new variables that are transformed from the current data set.
Firstly, we investigate the effect of total number discipline, (total_disc) on (1) avg_share , (2) avg_reuse, (3)total challenges.
Secondly, we will investigate total challenges and its effect on average reuse to see if average reuse scores increases when researchers encounter more challenges when finding data.
As previously mentioned, disc_total(numerical) refers to the the total number of discipline reported by each individual, it is to be taken as a proxy measure of how multidisciplinary one’s work might be. It is a numerical value from 1 to 20. Firstly, we shall take a look at the distribution of total discipline.
##
## 1 2 3 4 5 6 7 8 9 10 15
## 816 370 235 123 46 18 9 7 2 3 1
Observation A large majority of 816 individuals are only involved in 1 discipline, with 370 in 2 disciplines and 235 in 3 disciplines. The histogram is very right skewed with one mode at the 1 discipline.
Evidently, not majority of researchers are involved in multidisciplinary research. Despite that, we would investigate further to see if there are any positive effects on other factors.
Future work One could filter the groups into 2 main groups, such as multidisciplinary or only one discipline for a better split and distribution of the population.
As previously mentioned, total_chal (numerical) refers to the total number of challenges reported by individuals when finding data. For each challenges, they are reported a “1” for “Yes” and “0” for “No”. It is then summed up under a new numerical variable.
##
## 0 1 2 3 4 5 6
## 139 269 398 447 250 88 39
Conclusion As seen in the histogram, the distribution of total challenges faced for researchers is relatively symmetrical with only a slight left skewness and a primary mode at 3 challenges. The top 3 number of challenges reported by respondents are 3 with 447 respondents, 398 respondents with 2 challenges and 269 with 1 challenge.
For the following parts we will be exploring our 2 new variables with other variables in various bivariate analysis.
After exploring the effect of total number of disciplines on average sharing, we shall investigate its effect on average reuse and compare to see there is any significant difference.
##
## Call:
## lm(formula = avg_reuse ~ disc_total, data = main.df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.756 -0.541 0.321 0.709 1.459
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.51034 0.05116 68.617 <2e-16 ***
## disc_total 0.03067 0.02091 1.467 0.143
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.18 on 1628 degrees of freedom
## Multiple R-squared: 0.001319, Adjusted R-squared: 0.0007059
## F-statistic: 2.151 on 1 and 1628 DF, p-value: 0.1427
Conclusion and Comparison With regards to fig __ , there is no real significant effect of total number of disciplines on average reuse scores for both groups since the adjusted R-squared value of predicting average reuse using total disciplines is only 0.0007059.
When comparing, the effect of total disciplines on sharing is more significant given the higher Adjust R-squared value of 0.001131 against 0.0007059. However, all in all one can still surmise that the correlation between total number of disciplines and the average sharing and reuse scores are low given the gentle sloop of the plotted line and the low Adjusted R-square values.
Next, we might hypothesized that as one is involved in more disciplines, the challenges faced in finding data could become more complex and thus one might run into more problems. Hence we seek to investigate this.
Observation There is no strong relationship or correlation observed between the two factor. Given that there is a high density of individuals with low number of disciplines there is no significant relationship.
Next, we would like to see if the number of challenges they faced while finding data has resulted in an increased propensity or willingness to reuse data. This is on the assumption that if one have trouble finding desired data set they might be more likely to reuse prior data sets more readily available.
Observation For non-sharers, the relationship between number of challenges and reuse tendency is more defined and positive than that of the sharers. As they encountered more challenges, they are more likely to have reported a higher willingness to reuse data.
However, we should also be wary since that the sample size of non-sharers is significantly smaller than that of the sharers. Only 232 non-shares and 1398 sharers.
The model of the classification tree is used to predict whether one has “shared” using 5 variables that we have mentioned earlier. They are also follows: (1) Willingness to share data individually
(2) Willingness to share data within disciplines (3) Willingness to reuse data individually (4) Willingness to reuse data within disicpline (5) Experience group
Conclusion The classification trees indicate that if they indicated a strong inclination to share themselves ( > 4), they are highly likely to have shared. (level 1, 85%) This is also known as the most important factor in determining whether have they shared data. Likewise when their experience is between 0-5 and given that they have expressed a willingness to share, they are also predicted to have shared more often that their counter parts. (level 2, 23%)
Due to the formula used to compute average score and average reuse, a “0” is given for response that are “Not Asked”. This will result in a lower score computed for both cases.
With regards to disciplines and challenges, denoting them as a binary classification might not reflect the whole spectrum of information available. Since one challenge might pose bigger problems than another challenge for instance, and they are not weighted equally in reality. The same could be said for the compexity or difficulty of the disciplines involved.
The following examines open response question Q12, asking the respondents to specify any information they consider when deciding whether to use or not to use secondary data:
Conclusion:
According to the frequency of the words that appeared in the open response, the results highlight common words such as reliability, free/cost, time period/date of data, and relevance. Therefore, when deciding whether to use or not secondary data, researchers tend to have these factors in mind.
##
## 4-sample test for equality of proportions without continuity
## correction
##
## data: reuse_counts$dem_reuseself out of reuse_counts$dem_reusegrp
## X-squared = 2.5545, df = 3, p-value = 0.4655
## alternative hypothesis: two.sided
## sample estimates:
## prop 1 prop 2 prop 3 prop 4
## 0.8029197 0.7463002 0.7450425 0.7766497
##
## 4-sample test for equality of proportions without continuity
## correction
##
## data: reuse_counts$dem_reuseself out of reuse_counts$dem_reusedisc
## X-squared = 2.9936, df = 3, p-value = 0.3926
## alternative hypothesis: two.sided
## sample estimates:
## prop 1 prop 2 prop 3 prop 4
## 0.7482993 0.6749522 0.6813472 0.6830357
##
## 4-sample test for equality of proportions without continuity
## correction
##
## data: reuse_counts$dem_reuseself out of reuse_counts$dem_reuseorg
## X-squared = 3.2616, df = 3, p-value = 0.353
## alternative hypothesis: two.sided
## sample estimates:
## prop 1 prop 2 prop 3 prop 4
## 0.7284768 0.6585821 0.6608040 0.6923077
We investigate the differences in how students, researchers, and support professionals values information when deciding whether to use secondary data.
The majority of students indicate they find data collection conditions and methedology, reputation of data source, metadata and documentation, how the data was processed and handled, and topic relevance to be factors that are important when deciding to use secondary data
It appears that information such as data collection conditions and methodology, topic relevance, correct coverage and how the data was processed and handled where the most important for students when they decided whether to use secondary data. Conversely, personally knowing the data creator was not important for nearly 45% of the students and less important for nearly 30% of students. This could indicate that students care more about the data quality itself a bit more than the the data creator or source.
Researchers indicated reputation of data source, collection conditions and methodology, ease of access, how the data was processed and handled, and topic relevance to be important factors when deciding whether to use secondary data
It appears that researchers care about the similar information that are important to students when deciding whether to use secondary data with collection conditions and methodology being the most important followed by how the data was processed and handled and topic relevance. Similarly to the students, personally knowing the data creator is the least important factor with around 25% of researchers indicating it is not important followed by around 23% indicating it is less important. Researchers appear to care more about the quality of the data itself more than the data creator, size, format or the original intent of the data.
Support professionals indicated data collection conditions and methodology, detailed metadata and documentation, ease of data access, how the data was processed and handled, and topic revelance to be important when deciding whether to use secondary data
It appears that support professional consider the having detailed metadata and documentation for the data to be important information when deciding whether to use secondary data. Collection condition and methodology, topic relevance, ease of access, and how the data was processed and handled are also important information for support professionals. While data quality is still important information, it appears that support professional indicate information that aids in access and understanding data to be more important than the importance students and researchers place on this information. This makes sense as support professionals likely find data for clients who are the ones who actually use the data. While support professional are responsible to find good quality data, they consider information that would make accessing and understanding or explaining the data easier as well.
We investigate the differences in how students, researchers, and support professionals establish trust in data.
Transparency in data collection methods, lack of errors in the data, and having prior usage of the data are the factors than are improtant for students when establishing trust in data
It appears that students believe that data transparency and accuracy are important factors when establishing trust in data. This is evident from the percentage of students who indicated extremely important and important for lack of error (over 30% and over 50%) and for transparency in data collection methods (nearly 50% and nearly 40%). There is also a large percentage of students who indicate prior usage as important (nearly 60%) in establishing trust in data but the percentage indicate it is extremely important is low comapred to other factors that appear to be important. This could indicate that while having prior usage is helpful in establishing trust, it is not as important as the accuracy, transparency and even ease of accessing data for students.
Lack of errors, prior usage transparency in data collection methods and reputation of data source are all important factors for researchers when establishing trust in data.
It appears that researchers, like students, value factors dealing with data transparency and accuracy when it comes to establishing trust in data. The percentage of researchers who indicated transparency in data collection methods and lack of errors in data as extremely important (just over 50% and around 37%) surpasses the percentage of students who indicated the same. It appears that the reputation of the data source and ease of access is more important among researchers than students in establishing trust in data. This could indicate that researchers, who could be producing more important or impactful work than the majority of students, values not only data transparency and accuracy but also the reputation of their source and the ease of accessing data when determining whether to trust the data.
Support professionals consdier ease of accesstransparency in data collection methods, and the reputation of data source to be important factors when establishing trust in data.
It appears that research support professional place more importance on ease of access and reputation of the data source than researchers or students. This is reasonable as research support professionals may not be experts in niche fields in which their clients any request data or aid. Thus, factors such as a reputable source or protected data could be interpreted as the data being more trustworthy. Transparency in data collection methods remains the factor that has the greatest percentage of respondents indicating extremely important which could mean that all respondents, students, researchers and support professional regard data collection methods to be the most important factor when deciding whether to trust data. Across all respondents, personal relationships with the data creator is indicated as not important in establishing trust in data.
We investigate below the differences in how students, researchers and support professional establish the quality of data.
Factors such as lack of errors in data, data preparation, clarity, completness, and reputation of data source are all important in establishing quality of data for students
Similar to how students establish trust in data, it appears that factors regarding the data itself, the lack of error, data resolution or clarity, data completeness and reputation of the source, plays an important part for students. The lack of errors in data is extremely important for nearly 50% of students and important for nearly 40% of students. The reputation of the data creator and data size are less significant for students, over 20% indicated less important and over 10% indicated not important for both factors respectively.
Lack of errors remain, data preparation, clarity and completeness all remain important in establishing quality of data for researchers.
It appears that many factors such as lack of errors and data that were important to students are important for researchers when establishing the quality of data. It is interesting that researchers appear to value more the reputation of the data creator and reputation of data source than students (Over 45% of researchers indicated extremely important or important compared to around 35% of students for reputation of data creator. Around 70% of researchers indicated extremely important or important compared to around 60% of students for reputation of data source). The importance of data preparation appears to decease in researchers compared to students (around 16% and 42% of researchers indicating extremely important and important respectively compared to around 16% and 50% for students indicating the same level of importance). This could indicate that students place greater value on how “clean” or “prepared” data is when establishing data quality while researchers value reputation of the data origin. This could potentially indicate that students are more willing to use data that are created or collected by less reputable sources.
Support professionals share some similarities with students and researchers but values less in data size and ease of downloading and exploring when establishing quality in data.
Support professionals, like students and researchers, value factors such as lack of errors, data clarity, consistency and completeness when establishing data quality. However, it appears that support professionals are not as certain in any specific factor being exceedingly important like students and researchers were about the importance of lack of errors. Data size appears to be less valuable for support professional when establishing quality of data with over 50% indicating that factor is less important or not important. Overall, support professionals appears to be decided in consistency in formatting, data completeness, lack of errors, and resolution or clarity of the data as important factors in establishing data quality, and less certain about other factors seem by the greater percentages indicating somewhat important than students or researchers.
##
## Data sharing is neither encouraged nor discouraged
## 151
## Data sharing is somewhat discouraged
## 30
## Data sharing is somewhat encouraged
## 440
## Data sharing is strongly discouraged
## 10
## Data sharing is strongly encouraged
## 972
## Don't know/ Not applicable
## 27
##
## Data sharing is neither encouraged nor discouraged
## 0.092638037
## Data sharing is somewhat discouraged
## 0.018404908
## Data sharing is somewhat encouraged
## 0.269938650
## Data sharing is strongly discouraged
## 0.006134969
## Data sharing is strongly encouraged
## 0.596319018
## Don't know/ Not applicable
## 0.016564417
##
## Data sharing is neither encouraged nor discouraged
## 289
## Data sharing is somewhat discouraged
## 105
## Data sharing is somewhat encouraged
## 612
## Data sharing is strongly discouraged
## 34
## Data sharing is strongly encouraged
## 538
## Don't know/ Not applicable
## 52
##
## Data sharing is neither encouraged nor discouraged
## 0.17730061
## Data sharing is somewhat discouraged
## 0.06441718
## Data sharing is somewhat encouraged
## 0.37546012
## Data sharing is strongly discouraged
## 0.02085890
## Data sharing is strongly encouraged
## 0.33006135
## Don't know/ Not applicable
## 0.03190184
##
## Data sharing is neither encouraged nor discouraged
## 361
## Data sharing is somewhat discouraged
## 121
## Data sharing is somewhat encouraged
## 618
## Data sharing is strongly discouraged
## 32
## Data sharing is strongly encouraged
## 434
## Don't know/ Not applicable
## 64
##
## Data sharing is neither encouraged nor discouraged
## 0.22147239
## Data sharing is somewhat discouraged
## 0.07423313
## Data sharing is somewhat encouraged
## 0.37914110
## Data sharing is strongly discouraged
## 0.01963190
## Data sharing is strongly encouraged
## 0.26625767
## Don't know/ Not applicable
## 0.03926380
##
## Data sharing is neither encouraged nor discouraged
## 404
## Data sharing is somewhat discouraged
## 104
## Data sharing is somewhat encouraged
## 546
## Data sharing is strongly discouraged
## 34
## Data sharing is strongly encouraged
## 430
## Don't know/ Not applicable
## 112
##
## Data sharing is neither encouraged nor discouraged
## 0.24785276
## Data sharing is somewhat discouraged
## 0.06380368
## Data sharing is somewhat encouraged
## 0.33496933
## Data sharing is strongly discouraged
## 0.02085890
## Data sharing is strongly encouraged
## 0.26380368
## Don't know/ Not applicable
## 0.06871166
The following looks into open response questions Q10a and L12a, asking respondents to discuss how their process for finding data is different than the process for finding academic literature.
Conclusion:
The responses to the question asking how the process for finding data is different than the process for finding academic literature included many words that showed up more than others (outside of data and literature which were part of the question). This included words like search, repositories, resources, and specific. Unlike finding literature, the results may suggest that many respondents may visit different repositories and are required to employ different search processes when finding data. The results did not appear to have any obvious patterns or differences between all the responses versus only from the respondents who answered yes.
Conclusion:
Similar to the support dataset, common words that were present in this open response question include ones such as search, specific, and sources. This may suggest that one of the reasons why researchers engage in a different process in finding data is because data requires specific sources (contacting data creator, specialized databases, etc) that are found through ways different from finding academic literature. For example, one respondent suggests that “finding literature is by google search, finding data has many more ways.”
From the support dataset, we learn that research support professionals were generally in the middle stages of their career, with a vast majority specializing in the natural and applied sciences and employed in either Europe or North America. The majority of researchers and research support professionals need observational or empirical data, and they use or need secondary data for either a new study or for teaching or training. Finding data primarily happens by actively searching online and through sources such as conversations with personal networks and mailing lists. Moreover, by analyzing the open response questions, we see that reliability, cost, date of data, and relevance influence researchers’ decision on whether or not to use secondary data.
From the researcher dataset, we found that an important factor in determining data share and (re)use behavior is the experience level of the respondents. Respondents with more experience indicated they find data themselves less than respondents with less years of experience. Additionally, challenges to finding data pertaining to having the necessary resources and connections also decrease as the years of experience increases. The challenge of data being in many different places increases with experience level, perhaps indicating with more access to available data, respondents with more experience realize it becomes more difficult to find specific or targeted data. This could point to a broader issue with the organization of data beyond just the accessibility of data.
Students indicated finding data was more challenging than respondents who indicated themselves to be researchers. Additionally, more students indicated lack of personal networks and data is inaccessible to be challenges compared to researcher
When it comes to the limitations of our analyses and the dataset, there are a couple that should be noted. First of all, the support dataset was relatively small, with only 47 respondents. This means that the limited number of responses to certain questions and analyses of various demographic variables must be taken into account. In addition, because of a low response rate to the survey, the potential for nonresponse bias must be considered as responses may not be representation of the larger target population. As mentioned before, the dataset also works with categorical data, limiting our ability to carry out further analyses or inferences.
This limitation aligns with the potential next steps, as we would love other survey questions that could gather quantitative responses from the researchers and support professionals. Additionally, a suggested future area of research could pertain to the organization of data in more specific sources and areas. From the analysis on data find challenges, we found that respondents with more experience had less difficulty accessing data compared to respondents with less experience but had a greater proportion indicating data to be in many different places compared to less experienced groups. Another suggested area of future work is in the challenges of finding data among students. From our analysis in ease of finding data among students and researchers, we discovered students overall indicated finding data is difficult with one of their challenges being data is not accessible. The question of if data is not accessible because students lack the necessary experience to simply obtain the data they desire or if they are not aware of data sources and data find resources available to help them find data could be explored further.